In part 2, we trained and logged a problematic model, and then reverted the commit to restore a good version.
Now we'll train an even better model—one that can also classify tweets in German—but this time using a separate branch and merge, instead of committing directly to master
.
This workflow requires verta>=0.14.1
and spaCy>=2.0.0
.
Instead of spaCy's English model, we'll be building off of a multilingual model.
In [1]:
!python -m spacy download xx_ent_wiki_sm
Then, as before, import libraries we'll need...
In [2]:
from __future__ import unicode_literals, print_function
import boto3
import json
import numpy as np
import pandas as pd
import spacy
...and instantiate Verta's ModelDB Client.
In [3]:
from verta import Client
client = Client('https://app.verta.ai')
proj = client.set_project('Tweet Classification')
expt = client.set_experiment('SpaCy')
Our multilingual model needs German training data to classify German tweet, so we'll download two datasets from S3.
Before, we trained a model on just english-tweets.csv
. Now, we're going to also train with german-tweets.csv
.
In [4]:
S3_BUCKET = "verta-starter"
EN_S3_KEY = "english-tweets.csv"
EN_FILENAME = EN_S3_KEY
DE_S3_KEY = "german-tweets.csv"
DE_FILENAME = DE_S3_KEY
boto3.client('s3').download_file(S3_BUCKET, EN_S3_KEY, EN_FILENAME)
boto3.client('s3').download_file(S3_BUCKET, DE_S3_KEY, DE_FILENAME)
In [5]:
import utils
en_data = pd.read_csv(EN_FILENAME)
de_data = pd.read_csv(DE_FILENAME)
data = pd.concat([en_data, de_data], axis=0)
data = data.sample(frac=1).reset_index(drop=True)
utils.clean_data(data)
data.head()
Out[5]:
As with before, we'll capture and log our model ingredients. Note that now we're logging both of our datasets from S3.
In [6]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.dataset import S3
from verta.environment import Python
code_ver = Notebook() # Notebook & git environment
config_ver = Hyperparameters({'n_iter': 20})
dataset_ver = S3([
"s3://{}/{}".format(S3_BUCKET, EN_S3_KEY),
"s3://{}/{}".format(S3_BUCKET, DE_S3_KEY),
])
env_ver = Python() # pip environment and Python version
But instead of committing directly to master, we'll checkout and commit to a separate branch.
In [7]:
repo = client.set_repository('Tweet Classification')
commit = repo.get_commit(branch='master').new_branch('multilingual')
In [8]:
commit.update("notebooks/tweet-analysis", code_ver)
commit.update("config/hyperparams", config_ver)
commit.update("data/tweets", dataset_ver)
commit.update("env/python", env_ver)
commit.save("Support German tweets")
commit
Out[8]:
You may verify through the Web App that this commit—on branch multilingual
—updates the dataset, as well as the Notebook.
Again as before, we'll train the model and log it along with the commit to an Experiment Run.
In [9]:
nlp = spacy.load('xx_ent_wiki_sm')
In [10]:
import training
training.train(nlp, data, n_iter=20)
In [11]:
run = client.set_experiment_run()
run.log_model(nlp)
In [12]:
run.log_commit(
commit,
{
'notebook': "notebooks/tweet-analysis",
'hyperparameters': "config/hyperparams",
'training_data': "data/tweets",
'python_env': "env/python",
},
)
Our model seems to be handling our multilingual data just fine, so we'll merge our improvements into master
.
In [13]:
commit
Out[13]:
In [14]:
master = repo.get_commit(branch="master")
master
Out[14]:
In [15]:
master.merge(commit)
master
Out[15]:
Now we've merged multilingual
into master
, bringing in our verified and proven changes.
Again, the Web App will show this merge commit on master
updating the dataset and the Notebook.